For this data visualization, a dataset containing information about the gender distribution of different college majors and major areas was obtained from the FiveThirtyEight Github. The dataset (women-stem.csv) contains a list of college majors with their more generalized associated major categories (Engineering, Physical Sciences, Computers & Mathematics, Health, Biology & Life Science), which are from Carnevale et al, “What’s It Worth?: The Economic Value of College Majors.” Georgetown University Center on Education and the Workforce, 2011. http://cew.georgetown.edu/whatsitworth. The dataset also contains information from the American Community Survey 2010-2012 Public Use Microdata Series regarding the total number of men and women in each major during this time frame, as well as a column indicating the proportion of women in each major (ShareWomen). A preview of this dataset is shown below:
df = read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/women-stem.csv')
head(df)
## Rank Major_code Major Major_category
## 1 1 2419 PETROLEUM ENGINEERING Engineering
## 2 2 2416 MINING AND MINERAL ENGINEERING Engineering
## 3 3 2415 METALLURGICAL ENGINEERING Engineering
## 4 4 2417 NAVAL ARCHITECTURE AND MARINE ENGINEERING Engineering
## 5 5 2418 NUCLEAR ENGINEERING Engineering
## 6 6 2405 CHEMICAL ENGINEERING Engineering
## Total Men Women ShareWomen Median
## 1 2339 2057 282 0.1205643 110000
## 2 756 679 77 0.1018519 75000
## 3 856 725 131 0.1530374 73000
## 4 1258 1123 135 0.1073132 70000
## 5 2573 2200 373 0.1449670 65000
## 6 32260 21239 11021 0.3416305 65000
To further analyze the gender distribution of different collge major categories, it was necessary to create a column containing the proportion of men in each major (ShareMen). In addition, the dataset was grouped by major category and the average proportion of men and women in each major category was calculated, creating the Total_Men and Total_Women columns, labled as “Female Students” and “Male Students”. The final dataset used for the visualization contained only the Major_Category, Total_Women, and Total_Men columns, as shown below.
library(htmlwidgets)
library('tidyverse')
## ── Attaching packages ───────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library('plotly')
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(ggthemes)
library(forcats)
#Create ShareMen Column (Proportion of Men by Major)
df$ShareMen <- with(df,(Men / Total))
#DATA CLEANING: MAJOR CATEGORY BY GENDER (AVERAGE PROPORTION OF MEN vs. WOMEN by MAJOR CATEGORY)
df = df %>% group_by(Major_category) %>% summarise(Total_Women = (mean(ShareWomen)*100), Total_Men = (mean(ShareMen)*100))
names(df)[2] <- "Female Students"
names(df)[3] <- "Male Students"
As shown below, women tend to dominate the health field while the the fields of engineering and computer science & mathematics are largely dominanted by male students.
#PLOT OF MAJOR CATEGORY BY GENDER (AVERAGE PROPORTION OF MEN vs. WOMEN)
knitr::opts_chunk$set(fig.width=10, fig.height=8)
p2 = ggplot(data = df %>% gather(Variable, value, -Major_category),
aes(x = reorder(Major_category, value), y = value, fill = Variable, some_dummy_mapping = value[Variable])) + geom_bar(stat = 'identity', position = 'stack')
p2 = p2 + theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
p2 = p2 + xlab("Major Category") + ylab("Percent") + ggtitle("Major Category by Gender") + geom_text(aes(label=sprintf("%0.2f", round(value, digits = 2))), position= position_stack(vjust = 0.85), size=3.5) + scale_fill_brewer(palette="Blues") + theme_minimal()+ theme(axis.text.x = element_text(angle = 15, hjust = 1)) + theme(plot.title = element_text(hjust = 0.5)) + theme(legend.title = element_blank())
p2 = p2 + theme(plot.title = element_text(size=14, face="bold.italic"), axis.title.x = element_text(face="bold"),
axis.title.y = element_text(face="bold"))
p2
Each of these major categories consists of many majors. To further examine which specific college majors are most popular amongst female students and male students, I subsetted the data to only contain the top 10 college majors with the highest proportion of female students as well as the top 10 college majors with the highest proportion of male students. These datasets (named mostwomen and mostmen) as well as their respective barplots showing the gender distributions for these majors are shown below.
#COLLEGE MAJORS WITH HIGHEST PROPORTION OF WOMEN (TOP WOMEN'S COLLEGE MAJORS)
knitr::opts_chunk$set(fig.width=13, fig.height=8)
df = read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/women-stem.csv')
#Create ShareMen Column (Proportion of Men by Major)
df$ShareWomen = with(df,(ShareWomen*100))
df$ShareMen = with(df,(Men/ Total)*100)
mostwomen = top_n(df, n=10,ShareWomen)
mostwomen = mostwomen[ -c(1,2,4,5,6,7,9)]
names(mostwomen)[2] <- "Female Students"
names(mostwomen)[3] <- "Male Students"
print(mostwomen)
## Major
## 1 NURSING
## 2 NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL TECHNOLOGIES
## 3 MEDICAL TECHNOLOGIES TECHNICIANS
## 4 MEDICAL ASSISTING SERVICES
## 5 MISCELLANEOUS HEALTH MEDICAL PROFESSIONS
## 6 HEALTH AND MEDICAL ADMINISTRATIVE SERVICES
## 7 NUTRITION SCIENCES
## 8 COMMUNITY AND PUBLIC HEALTH
## 9 GENERAL MEDICAL AND HEALTH SERVICES
## 10 COMMUNICATION DISORDERS SCIENCES AND SERVICES
## Female Students Male Students
## 1 89.60190 10.398101
## 2 75.04726 24.952741
## 3 75.39274 24.607264
## 4 92.78072 7.219275
## 5 88.12939 11.870611
## 6 76.44265 23.557347
## 7 86.44561 13.554392
## 8 79.20953 20.790474
## 9 77.45766 22.542338
## 10 96.79981 3.200188
#COLLEGE MAJORS WITH HIGHEST PROPORTION OF MEN (TOP MEN'S COLLEGE MAJORS)
mostmen = top_n(df, n=10,ShareMen)
mostmen = mostmen[ -c(1,2,4,5,6,7,9)]
names(mostmen)[2] <- "Female Students"
names(mostmen)[3] <- "Male Students"
print(mostmen)
## Major Female Students
## 1 PETROLEUM ENGINEERING 12.056434
## 2 MINING AND MINERAL ENGINEERING 10.185185
## 3 METALLURGICAL ENGINEERING 15.303738
## 4 NAVAL ARCHITECTURE AND MARINE ENGINEERING 10.731320
## 5 NUCLEAR ENGINEERING 14.496697
## 6 MECHANICAL ENGINEERING 11.955890
## 7 AEROSPACE ENGINEERING 13.979280
## 8 ENGINEERING AND INDUSTRIAL MANAGEMENT 17.412251
## 9 MATHEMATICS AND COMPUTER SCIENCE 17.898194
## 10 MECHANICAL ENGINEERING RELATED TECHNOLOGIES 7.745303
## Male Students
## 1 87.94357
## 2 89.81481
## 3 84.69626
## 4 89.26868
## 5 85.50330
## 6 88.04411
## 7 86.02072
## 8 82.58775
## 9 82.10181
## 10 92.25470
#TOP WOMEN'S COLLEGE MAJORS PLOT
p3 = p2 = ggplot(data = mostwomen %>% gather(Variable, value, -Major),
aes(x = reorder(Major, value), y = value, fill = Variable, some_dummy_mapping = value[Variable])) + geom_bar(stat = 'identity', position = 'stack')
p3 = p3 + theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
p3 = p3 + xlab("Major Category") + ylab("Percent") + ggtitle("TOP COLLEGE MAJORS FEMALE STUDENTS") + geom_text(aes(label=sprintf("%0.2f", round(value, digits = 2))), position= position_stack(vjust = 0.8), size=3.5) + scale_fill_brewer(palette="Blues") + theme_minimal()+ theme(plot.title = element_text(hjust = 0.5)) + theme(legend.title = element_blank()) + coord_flip()
p3 = p3 + theme(plot.title = element_text(size=14, face="bold.italic"), axis.title.x = element_text(face="bold"),
axis.title.y = element_text(face="bold"))
p3 = ggplotly(p3, tooltip = c("value")) %>% as_widget()
p3
#TOP MEN'S COLLEGE MAJORS PLOT
p4 = ggplot(data = mostmen %>% gather(Variable, value, -Major),
aes(x = reorder(Major, value), y = value, fill = Variable, some_dummy_mapping = value[Variable])) + geom_bar(stat = 'identity', position = 'stack')
p4 = p4 + theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
p4 = p4 + xlab("Major Category") + ylab("Percent") + ggtitle("TOP COLLEGE MAJORS MALE STUDENTS") + geom_text(aes(label=sprintf("%0.2f", round(value, digits = 2))), position= position_stack(vjust = 0.8), size=3.5) + scale_fill_brewer(palette="Blues") + theme_minimal() + theme(plot.title = element_text(hjust = 3)) + theme(legend.title = element_blank()) + coord_flip()
p4 = p4 + theme(plot.title = element_text(size=14, face="bold.italic"), axis.title.x = element_text(face="bold"),
axis.title.y = element_text(face="bold"))
p4 = ggplotly(p4, tooltip = c("value")) %>% as_widget()
p4
There has been a large societal push in recent years to encourage and promote Women in STEM (Science, Technology, Engineering, and Mathematics), as research has found many barriers can impede women’s progress in STEM. This includes gender stereotypes and the overall environment of science and engineering departments in colleges and universities. As indicated by the plots above, women make up a large proportion of students in scientific fields related to health and medicine. However, there is still more to be done to increase women participation in the fields of engineering, mathematics, and computer science, as these fields appear dominated by males according to this data.